Automatic Numbers Normalization in Inflectional Languages
نویسندگان
چکیده
This paper is devoted to the text normalization module in our text-to-speech synthesis system. We focused on conversion numerals written as figures into a readable full-length form. The numerals conversion is a significant issue in inflectional language as Czech, Russian or Slovak because morphological and semantic information is necessary to make the conversion unambiguous. In the paper three part-of-speech tagging methods are compared. Furthermore, a method reducing the tagset to increase the numerals conversion accuracy is presented in the paper.
منابع مشابه
An Improved Stemming Approach Using HMM for a Highly Inflectional Language
Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has been performed on designing algorithms for stemming of texts in Indic languages. In this study, we ...
متن کاملVery-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms
Wiktionary is a large-scale resource for cross-lingual lexical information with great potential utility for machine translation (MT) and many other NLP tasks, especially automatic morphological analysis and generation. However, it is designed primarily for human viewing rather than machine readability, and presents numerous challenges for generalized parsing and extraction due to a lack of stan...
متن کاملConstructional Potentiality: Priscianic grammar as a disambiguation technique in the automatic recognition of Latin syntax
technique in the automatic recognition of Latin syntax In most languages word order plays the major role in determining which words form a single phrase or constitute. A tree s£ructure can be abstracted automatically from a sentence by linear determination of the major syntactic constitutes. However, in certain highly-inflected languages, of which Latin is perhaps the most extreme example, cons...
متن کاملAn Approach to Lexical Development for Inflectional Languages
We describe a method for the semi-automatic development of morphological lexicons. The method aims at using minimal pre-existing resources and only relies upon the existence of a raw text corpus and a database of inflectional classes. No lexicon or list of base forms is assumed. The method is based on a contrastive approach, which generates hypothetical entries based on evidence drawn form a co...
متن کاملAutomatic Identification of Learners' Language Background Based on Their Writing in Czech
The goal of this study is to investigate whether learners’ written data in highly inflectional Czech can suggest a consistent set of clues for automatic identification of the learners’ L1 background. For our experiments, we use texts written by learners of Czech, which have been automatically and manually annotated for errors. We define two classes of learners: speakers of Indo-European languag...
متن کامل